Goto

Collaborating Authors

 vision-and-language navigation



Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions Heng Li

Neural Information Processing Systems

Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions.







HistoryAwareMultimodalTransformerfor Vision-and-LanguageNavigation

Neural Information Processing Systems

HAMT efficientlyencodes allthepastpanoramic observationsviaahierarchical vision transformer (ViT), which first encodes individual images with ViT, then models spatial relation between images in a panoramic observation and finally takes into account temporal relation between panoramas in the history.


Frequency-enhanced Data Augmentation for Vision-and-Language Navigation--- -- Supplemental Material--- -- Keji He

Neural Information Processing Systems

Table 1 presents the impacts of different random seeds for sampling the interference images. Experiments in the main manuscript are based on seed-1 which has an average performance. Figure 1: Navigation examples in normal and high-frequency perturbed scenes. In the examples shown in Figure 4, both models obtained similar textual attention. In Figure 6, according to the given instruction, the agent should turn left to enter the room corresponding to the second view.